Local Aware IBM#1378
Conversation
… are only locally aware
|
Claude Code Review Head SHA: 9879df3 Files changed:
Findings: Critical: Stack overflow in
|
…s param description
Re-run ffmt with the corrected 0.4.0 build that includes the breaks.retain filter preventing line splits immediately before `%` (member accessor). Fixes three occurrences of `& %sf` in m_ibm.fpp, one `& %geometry` in m_ibm.fpp, and one `& %beg` in m_time_steppers.fpp.
num_ib_patches_max = 2050000 caused a ~2.25 GB unconditional heap allocation + full init loop in s_assign_default_values_to_user_inputs on every startup, crashing the CI debug build even for cases with no IBM. Also, s_reduce_ib_patch_array had a 2.25 GB stack-allocated local array that caused SIGSEGV for IBM cases. Introduce num_ib_patches_max_namelist = 50000 (restoring the pre-particle-bed budget) for the initial allocation. s_generate_particle_beds grows patch_ib to num_ib_patches_max via MOVE_ALLOC only when particle beds are actually being generated. s_reduce_ib_patch_array now uses a heap-allocated local array sized to num_ibs instead of the full num_ib_patches_max.
When particle beds are used, rank 0 grows patch_ib to num_ib_patches_max inside s_generate_particle_beds. Non-rank-0 ranks only have the namelist-sized allocation (num_ib_patches_max_namelist). The num_ibs scalar is broadcast first, so we can check and grow before the per-patch MPI_BCAST loop accesses patch_ib(i) for i > num_ib_patches_max_namelist.
|
You had an issue that I think I fixed.
|
… limit NIB and the case_validator both now reference num_ib_patches_max_namelist (50000) instead of num_ib_patches_max (2050000). This constant is the actual namelist limit; particle beds grow patch_ib beyond it at runtime but those entries are never specified in the namelist. The fallback values match the constant, ensuring Homebrew installs (which lack m_constants.fpp) use the correct limit.
patch_ib is reallocated to num_aware_ibs slots (e.g. 54000 for 3D with the default ib_neighborhood_radius=1) before the 1-rank and no-MPI copy. Using patch_ib(1:num_gbl_ibs) crashed when num_gbl_ibs > num_aware_ibs (e.g. large particle beds on a single MPI rank). Use min() to clamp the copy, matching the original truncation behavior while avoiding the out-of-bounds write on the newly allocatable patch_ib_gbl.
For single-rank and no-MPI cases every IB patch is local, so shrinking patch_ib to num_aware_ibs (e.g. 54k for 3D) and then keeping num_ibs at num_gbl_ibs was a latent out-of-bounds: IBM loops over num_ibs would access patch_ib beyond its capacity. MOVE_ALLOC transfers patch_ib_gbl (sized exactly num_gbl_ibs) directly into patch_ib — no copy, no truncation, patch_ib size matches num_ibs. The num_aware_ibs resize is still done for multi-rank cases where each rank genuinely only needs its local neighbourhood subset.
patch_ib has GPU_DECLARE(create=...) in m_global_parameters, which means OpenACC/OpenMP tracks it via plain allocate/deallocate automatically. move_alloc is not reliably intercepted by all GPU runtimes for declare- create variables, so replace all three grow-on-demand sites (m_particle_bed, m_mpi_proxy, s_reduce_ib_patch_array 1-rank/no-MPI) with the established pattern: allocate tmp, copy, deallocate patch_ib, allocate patch_ib.
|
I generated and pushed the missing golden file for While reviewing the diff I found several other issues worth addressing: Critical (wrong results)
High (MPI / GPU correctness)
Post-process SILO writes called from all ranks ( Medium (robustness / performance)
Low (code quality)
The two highest-priority fixes are the |
|
Addressing AI comments: All of the critical issues are incorrect for one reason or another. I have gone through and verified that the relevant chunks of code are in place to address the concerns it has. As for the neighbor boundaries, they are never called in a GPU region, so this comment is mute. I have removed the GPU routine call to help with compilation in response. The file per process F claim is just false, and I tested that the code runs find without it. But it is true that we do not need other ranks to be doing any of this calculation regardless. I extended the rank 0 check in response. nreqs is in fact reset between loop iterations. Snapshot arrays are in fact reset between loop iterations It is confused about the need to GPU allocate arrays for patch_ib. Because the array is used with a declare, but never actually allocated on the GPU, we just have the GPU pointer. When we perform the first copy to the GPU to allocate the array size, we have already done the reshape. This means that the GPU array is properly sized. This is important because the size of the patch_ib array is significant in the new code and could causes issues with limiting problem size because the GPU must be able to hold the global patch_ib array, even though it would then immediately be deallocated and reallocated to a smaller array. If the number of ib patches grows to the extent that it costs many for GB of memory, this could cause memory limitations. The current implementation should be the correct way to go. the moving IBM flag should not be recomputed. You want it to be the same on all ranks. There is argument that it should be moved to the case level for case optimizaiton later, but that is a future PR. Force arrays getting moved is going to be handled in a future PR that already has a github issue I created yesterday. Prohibit for number of local IBs resolved. In summary, two of these I made minor adjustments for. One I reject for reasons that I believe are valid and better for long-term code support. All others are incorrect. |
|
General Note: Mac errors are on sections of the code that were not touched by this PR (bubbles), and it is a random dyl linker error. I get the same error on reldebug on a chemsitry case. I get the same error locally, but on a different set of tests. This seems like a MacOS instability issue, and not related to the particular PR. Going to attempt to rerun. If ti fails again, but on different tests, then it looks like the MacOS test suite checks may be broken. Edit: Rerunning the job changed the jobs which failed. This appears to be a spurious failure for MacOS specifically. It appears to be specific to the branch, as other branch are not experiencing it. Will investigate. |
Description
The current code has an issue scaling much past 10k particles due to limitations in the MPIAllReduceSum in the IB force computation. This PR attempts to alleviate this by limiting the number of IBs any given rank can be aware of to its neighbors. This turns the AllReduce compute to a MPI neighbor computation, removing the communication bottlneck. To support this, a massive overhaul of IB ownership between ranks was required.
Type of change
Testing
TBD
Checklist
AI code reviews
Reviews are not triggered automatically. To request a review, comment on the PR:
@coderabbitai review— incremental review (new changes only)@coderabbitai full review— full review from scratch/review— Qodo review/improve— Qodo code suggestions@claude full review— Claude full review (also triggers on PR open/reopen/ready)claude-full-review— Claude full review via label